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Abstract 


Entity linking is an indispensable oper¬ 
ation of populating knowledge reposito¬ 
ries for information extraction. It stud¬ 
ies on aligning a textual entity mention 
to its corresponding disambiguated entry 
in a knowledge repository. In this paper, 
we propose a new paradigm named dis¬ 
tantly supervised entity linking (DSEL), in 
the sense that the disambiguated entities 
that belong to a huge knowledge reposi¬ 
tory (Freebase) are automatically aligned 
to the corresponding descriptive webpages 
(Wiki pages). In this way, a large scale 
of weakly labeled data can be generated 
without manual annotation and fed to a 
classifier for linking more newly discov¬ 
ered entities. Compared with traditional 
paradigms based on solo knowledge base, 
DSEL benefits more via jointly leverag¬ 
ing the respective advantages of Free- 
base and Wikipedia. Specifically, the pro¬ 
posed paradigm facilitates bridging the 
disambiguated labels (Freebase) of entities 
and their textual descriptions (Wikipedia) 
for Web-scale entities. Experiments con¬ 
ducted on a dataset of 140,000 items and 
60,000 features achieve a baseline Fl- 
measure of 0.517. Furthermore, we ana¬ 
lyze the feature performance and improve 
the Fl-measure to 0.545. 


1 Introduction 


To build the “Digital Alexandria Library” for our 
human race, researchers in the NLP community 
have dedicated themselves to Information Extrac¬ 
tion (Sarawagi, 20081 over the past decades. In¬ 
formation extraction focuses on processing natu¬ 
ral language text to produce structured knowledge, 
which is usually represented as triples (two entities 


and their relation) for the convenience of storage 
in a database, retrieval, or even automatic reason¬ 
ing. For example, if we send a natural language 
sentence, Michael Jordan visited CMU yesterday, 
to the pipeline of information extraction machine, 
it will be processed by three operations in advance, 
i.e., 


• Named Entity Recognition (Nadeau and 
Sekine, 2007| ): Entities should firstly be 
identified and classified into predefined cate¬ 
gories, such as person (PER), location (LOC) 
and organization (ORG). The sentence will 
be annotated as [Michael Jordan]/PER vis¬ 
ited [CMU]/ORG yesterday, after being pro¬ 
cessed by this operation. 


Coreference Resolution ( jNg, 2010) ): Some 
entities may have alias or abbreviations. It 
is well known that CMU is the abbreviation 
for Carnegie Mellon University. The knowl¬ 
edge repository may only store the regular¬ 
ized name, e.g., Carnegie Mellon University, 
for this named entity, so coreference resolu¬ 
tion is indeed necessary. 


Relation Extraction (Bach and Badaskar, 
2007): After both of the named entities 


([Michael Jordan]/PER and [Carnegie Mel¬ 
lon University]/ORG) are recognized and 
regularized, we begin to study on the relation 
between them. In this case, we extract the 
verb visited and map it to the relation visit. 
Then the output will be a triple, i.e., ( Michael 
Jordan [PER], visit, Carnegie Mellon Uni¬ 
versity [ORG]). 


So far, we only abstract the triple as the struc¬ 
tured knowledge from the natural language sen¬ 
tence. However, it devotes nothing to increasing 
the scale of the knowledge repository such as Free- 
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online knowledge base with billions of triples and 
millions of disambiguated entities, and is primar¬ 
ily maintained by Google Inc., because we even do 
not know which exact Michael Jordan the triple 
(Michael Jordan [PER], visit, Carnegie Mellon 
University [ORG]) refers to in Freebase. As illus¬ 
trated in Figure 1, there are three different persons 
named Michael Jordan in Freebase and each of 
them may be the protagonist of that news. There¬ 


fore, to populate knowledge repositories (Ji and 


Grishman, 20111, we need th e fourth operation: 


Entity Linking ( Rao et al., 2013| ): It con¬ 
cerns about the study of aligning a textual 
entity mention to the corresponding disam¬ 
biguated entry in a knowledge repository. 
More specifically, since there are several 
Michael Jordan disambiguated by different 
MIDs (machine identifiers) as illustrated in 
Figure 1, we may build a classifier that can 
help assign the Michael Jordan in the ex¬ 
tracted triplet (Michael Jordan [PER], visit, 
Carnegie Mellon University [ORG]) to the 
exact named entity in Freebase or find out 
that this Michael Jordan is a newly discov¬ 
ered named entity (NIL). 


Hachey et al. (2013) and Rao et al. (2013) 


elucidate that most of the literatures (Bunescu 


and Pasca, 2006; Mihalcea and Csomai, 2007; 

Cucerzan, 2007, 

Milne and Witten, 2008; Rati- 

nov et al., 2011 

) and the entity linking tracks 4 

in TAC-KBP ( 

McNamee and Dang, 2009; Ji et 


al., 20101 concentrate on linking ambiguous enti¬ 
ties to the entries in Wikipedia, whereas our ul¬ 
timate goal is to populate the structured knowl¬ 
edge repository, e.g., Freebase. However, to the 


best of our knowledge, few works (Zheng et al., 


2012) concern about disambiguating named enti¬ 


ties using Freebase which contains much more en¬ 
tries but less text information for each entry than 
Wikipedia. 


'According to the statistics released on 10th March, 2014 
by Google Inc., there are about 1.9 billion Freebase triples 
and 43 million entities. 

2 The whole dump of Freebase can be downloaded from 

https://developers.google.com/freebase/ 
data 

’One can access to Freebase and contribute more knowl¬ 
edge. 

4 http://www.nist.gov/tac/2 013/KBP/ 
EntityLinking/index.html 


alias: MJ, His Airness, Mike Jordan 

Michael Jeffrey Jordan, also known by his initials, MJ, is an American former professional 
basketball player, entrepreneur, and majority owner and chairman of the Charlotte Bobcats. His 
biography on the National Basketball Association website states,... [Wikipedia] 
i MichaeJ^ Jordan_/m/0bby3vsi 
Person 

Michael Jordan is an English mycologist, author of The Encyclopedia of Fungi of Britain and 
Europe, founder and chairman of the Association of British Fungal Groups. Jordan founded 
ABFG in 1996, having observed an upsurge in interest in mushroom... [Wikipedia] 
lMj_c_haeJ_ JordarWm/04ntmci 

Soccer Goalkeeper, Athlete. Person, Football player 

Michael William Jordan is an English football goalkeeper bom in Cheshunt, Hertfordshire. He 
made seven appearances in the Football League for Chesterfield, having started his career as a 
trainee at Arsenal. [Wikipedia] 


Figure 1: The disambiguated entities with the 
same name Michael Jordan in Freebase. The en¬ 
tities in Freebase are disambiguated by a unique 
machine identifier, e.g., the famous basketball 
player, Michael Jordan labeled by 054c 1 (MID). 


Overall, Hachey et al. (2013j ) and Zheng et al. 
( 2012[ ) represent two research directions leverag¬ 
ing Wikipedia and Freebase, respectively. As both 
of the two collaborative web resources have their 
respective superiorities, i.e., more context infor¬ 
mation and more disambiguated entities, we be¬ 
gin to study a new paradigm that could bridge the 
gap between those two separated repositories and 
benefit from their respective advantages. From the 
perspective of supervised learning, entity linking 
can be naturally regarded as a classification prob¬ 
lem. To build a training dataset for disambiguating 
a set of entities with the same name, we can firstly 
collect the sentences that mention that name from 
webpages, such as Wiki page^] and then manu¬ 
ally annotate each entity mention with its unique 
machine identifier (MID) in Freebase given the 
contexts of sentences that it occurs in. However, 
hand-labeled data is time consuming and usually 
applicable to some specific classes of entities, such 
as person (PER), location (LOC) and organization 
(ORG). Therefore, we look forward to an approach 
that averts the tedious and laborious work. 

Inspired by the idea of weak labeling (Fan et 
|al., 2014| [Craven et al., 1999| ), we contribute a new 
paradigm called distantly supervised entity linking 
(DSEL) without manual annotation in this paper. 
More specifically, we take advantage of a heuris¬ 
tic alignment assumption based on crowd sourc¬ 
ing to connect a certain disambiguated entity in 
Freebase with its related webpages. In these web¬ 
pages, feature vectors can be extracted from the 


’The Wiki page for the famous basketball player, Michael 
Jordan, is http://en.wikipedia.org/wiki/ 
Michael_jordan 




























































Topic equivalent webpage 

Topic equivalent webpage 

http://en.wikipedia.org/wiki/index.html?curid=20455 
http://ja.wikipedia.org/wiki/index.html?curid=30336 
http://es.wikipedia.org/wiki/index.html?curid=10553 
http://de.wikipedia.org/wiki/index.html?curid=32444 
http://pt.wikipedia.org/wiki/index.html?curid=71267 
http://fr. Wikipedia.org/wiki/index.html?curid=50915 

Figure 2: The topic equivalent webpages of the 
famous basketball player, Michael Jordan in Free- 
base. 

sentence-level textual contexts of that entity men¬ 
tion, and be labeled by its corresponding MID in 
Freebase. Then we can produce a large scale of 
weakly labelecj^] dataset in this way. Moreover, it 
is unrealistic to learn a specific classifier for each 
entity, as there are about 43 million disambiguated 
entities in Freebase. To tackle with those chal¬ 
lenges, we propose a strategy of training a general 
classifier for disambiguating multiple entities and 
select a well known classifier, i.e., liblinear (Fan 


et al., 2008) to self-learn the weights among the 
high-dimensional sparse and noisy features. Ex¬ 
periments are conducted on a dataset of 140,000 
items and 60,000 features. DSEL achieves a base¬ 
line Fl-measure of 0.517. Furthermore, we ana¬ 
lyze the performance influenced by other different 
features, and finally the Fl-measure is improved 
to 0.545. 

2 Paradigm 

Traditional supervised learning methods for entity 
disambiguation require tedious labor on manual 
annotation to build training datasets. Manual an¬ 
notation costs a lot, and can only cover some spe¬ 
cific category, e.g., person names ( Christen, 2006| ) 
as well. Therefore, we look forward to explor¬ 
ing a paradigm that could automatically gener¬ 
ate large scale of open-category training datasets 
without manual annotation. Based on the dataset, 
we aim to build a practical classifier and generalize 
it to disambiguate more unlinked entity mentions 
in free texts. 

Freebase contains 43 million disambiguated en¬ 
tities falling into 76 categories. Each entity is 
assigned by a unique machine identifer (MID). 
Those MIDs are the natural labels for the newly 
identified entity mentions linking to. However, 


there are inadequate free texts locally for ex¬ 
tracting features, as Freebase is a well-structured 
knowledge repository with billions of triples. 
Therefore, we resort to other free-text coipus that 
could be distantly supervised by Freebase and the 
key challenge is to find the bridge of supervision. 

Fortunately, every entity in Freebase maintains 
a list of links to its topic equivalent webpages via 
crowd sourcing ( Howe, 2006) ) as shown in Fig¬ 
ure 2. These links will guide us to find the de¬ 
scription webpages for that entity. Even though 
those links involves in different languages, we 
only choose the English Wiki pages to conduct 
experiments. Overall, we jointly exploit Freebase 
and Wikipedia to automatically construct the data 
for training a classifier. 

3 Feature 

For each entity in Freebase, we find its topic- 
equivalent Wiki page and extract the contextual 
features of its mention at sentence level. 

Generally, we simultaneously choose K (K = 


1,2,3) open-class words (Van Petten and Kutas, 


1991), namely nouns, verbs, adjectives and ad¬ 
verbs, in front and behind the given entity men¬ 
tion. If we ignore the sequence of these words, 
we can gain the bag-of-words feature, whereas the 
word sequence feature. Furthermore, we use Stan¬ 
ford NFP coreQand add the part-of-speech tagging 
feature which may help disambiguate those con¬ 
textual words. Therefore, for each K size win¬ 
dow surrounding the entity mention, we could ex¬ 
tract four kinds of different features, i.e., bag of 
words (BOW), word sequence (WS), bag of words 
plus part-of-speech tagging (BOW + POS) and 
word sequence plus part-of-speech tagging (WS + 
POS). In total, there are twelve kinds of lexical 
features. 

To elucidate the various kinds of contextual fea¬ 
tures, we randomly pick up a sentence from the 
Wiki page of the famous basketball player as ex¬ 
ample, i.e., 

His biography on the National Basketball As¬ 
sociation (NBA) website states, “By acclamation, 
Michael Jordan is the greatest basketball player 
of all time.” 

The twelve kinds of lexical features for the sen¬ 
tence above are listed in Table 1. We will compare 
the performance among these features in Section 


6 Auto-labeling via crowd sourcing may naturally bring 
about noise. Therefore, we regard the dataset weakly labeled. 


'http://nip.Stanford.edu/software/ 
corenlp.shtml 
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Figure 3: The architecture of DSEL system. 


Sentence 

His biography on the National Basketball Association (NBA) website 
states, “By acclamation, Michael Jordan is the greatest basketball 
player of all time.” 

BOW (K = 1) 
BOW (K = 2) 
BOW (K = 3) 

<{acclamation}, {is}> 

<{states}, {acclamation}, {is}, {greatest}> 

<{website}, {states}, {acclamation}, {is}, {greatest}, {basketball}> 

WS (K = 1) 

WS (K = 2) 

WS (K = 3) 

<{acclamation}, {is}> 

<{states-acclamation}, {is-greatest}> 
<{website-states-acclamation},{is-greatest-basketball}> 

BOW + POS (K = 1) 
BOW + POS (K = 2) 
BOW + POS (K = 3) 

<{acclamation/NN}, {is/VBZ}> 

<{states/NNS}, {acclamation/NN}, {is/VBZ} {greatest/JJS}> 
<{website/NN}, {states/NNS}, {acclamation/NN}, {is/VBZ}, 
{greatest/JJS}, {basketball/NN}> 

WS + POS (K = 1) 
WS + POS (K = 2) 
WS + POS (I\ = 3) 

<{acclamation/NN}, {is/VBZ}> 
<{states/NNS-acclamation/NN}, {is/VBZ-greatest/JJS}> 
<{website/NN-states/NNS-acclamation/NN},{is/VBZ-greatest/JJS- 

basketball/NN } > 


Table 1: Twelve kinds of lexical features for the given sentence. A pair of angle brackets stands for a 
feature vector, e.g., <{states}, {acclamation}, {A} {greatest}>. A feature item is marked by a pair of 
braces, e.g., {states-acclamation}. 











































5. 


4 Implementation 


As we have already automatically produced a 
training dataset based on the proposed distant su¬ 
pervision paradigm, an intuitive idea is to feed 
a specific classifier for each ambiguous name 
with its unambiguous MIDs and the correspond¬ 
ing feature vectors. However, Table 2 shows 
that there are at least 5.5 million names that de¬ 
nominate more than one entity (MID) in Free- 
base. Therefore, it is infeasible to build 5.5 mil¬ 
lion specific classifiers. To train a general classi¬ 
fier that does not restrict itself to disambiguating 
a certain name, we adopt a strategy that merges 
those specific classifiers. Concretely, we trans¬ 
form MIDs, the original labels into features and 
use 1/0 to indicate whether the contextual features 
from Wiki pages and MIDs in Freebase match 
or not with each other. If we choose the BOW 
(K = 3) feature in Table 1 for instance, one 
positive training sample will contain a new fea¬ 
ture vector (<{website}, {states}, {acclamation}, 
{is}, {greatest}, {basketball}, {MID:054cl}>) 
labeled by 1. To balance the training dataset, 
we randomly pick up features from other entities 
uniformly named to generate negative samples. 
For example, another well-known Michael Jordan 
(MID:0bby3vs) is an English mycologist. We can 
extract a BOW (K = 3) feature vector, i.e., <{is}, 
{English}, {mycologist}> , and it concatenates 
{MID:054cl} to construct a negative sample la¬ 
beled by 0. 

The distant supervision paradigm and the strat¬ 
egy of building the training set for a general clas¬ 
sifier lead to high-dimensional noisy and sparse 
features. Moreover, given the millions of train¬ 
ing samples produced by aligning Freebase and 
Wikipedia, we choose a linear classifier that is 
based on logistic regression approach, i.e., Lib- 
linear ( Fan et ah, 2008| ), to rapidly self-learn the 
weights among the high-dimensional sparse and 
noisy features. 

For a newly discovered entity mention in the 
testing coipus, we firstly extract its contextual fea¬ 
ture, e.g., bag of words as above. Then the feature 
concatenates all the candidate MIDs that share the 
same name with that entity mention. Each testing 
sample within the same name collection will pre¬ 
dict a score indicating the strength of linking. For 
each collection, the Top-N predictions with higher 


probabilities are selected for evaluation. 

We summarize the procedures of implement¬ 
ing our proposed paradigm and use Figure 3 to 
demonstrate the architecture of DSEL system. 


5 Experiments 


In this section ,we report the experimental results 
following the procedures described in Section 4. 
To evaluate the performance of different features, 


we adopt three widely used metrics (Meij et al., 


2013), namely precision, recall and Fl-measure. 


5.1 Dataset 

We randomly select 20,000 ambiguous names 
(collections) in Freebase. About 82,000 sen¬ 
tences that contain at least one entity mention are 
extracted from the topic-equivalent Wiki pages. 
For each collection, 80% sentences are randomly 
picked up for constructing the training set and 20% 
remains are for held-out evaluation. Following the 
procedures of building training samples described 
in Section 4, we gain a dataset including around 
140,000 items and 60,000 features. 


5.2 Evaluation metrics 

Precision and recall are widely used metrics to 
evaluate different rank-based approaches on entity 
linking. Fl-measure synthetically measures preci¬ 
sion and recall by calculating the harmonic mean 
of them. Suppose that C denotes the whole col¬ 
lection set for testing. Cij represents the set of 
Top- ) predictions with higher probabilities in the 
i-th collection. Gi stands for the set of gold stan¬ 
dards of the i-th collection. #(S) is the function 
that counts the entries in set S. Then the formulae 
to calculate precision, recall and Fl-measure are 
as follows, 


Precision = EE 




Recall = ^2 


mM Gi) 

#(<?) 


2 x Precision x Recall 

Fl-measure =-. 

Precision + Recall 

5.3 Feature comparison 

For each type of feature, we conduct one trial and 
tune the parameters for the logistic classifier using 










# of MIDs with the 

same name 

# of 

names 

# of MIDs with the 

same name 

# of 

names 

# of MIDs with the 

same name 

# of 

names 

2 

4,467,216 

5 

180,489 

8 

60,273 

3 

740,530 

6 

134,012 

9 

41,256 

4 

440,261 

7 

76,459 

10 

33,628 


Table 2: The distribution of ambiguous entities in Freebase. 


Feature type 

Avg. FI-measure 

Feature type 

Avg. FI-measure 

BOW (K = 1) 

0.539 

WS (K = 1) 

0.544 

BOW (K = 2) 

0.531 

WS (K = 2) 

0.532 

BOW (K = 3) 

0.529 

WS (K = 3) 

0.518 

BOW + POS ( K = 1) 

0.540 

WS + POS {K = 1) 

0.545 

BOW + POS ( I< = 2) 

0.532 

WS + POS ( K = 2) 

0.531 

BOW + POS (K = 3) 

0.529 

WS + POS (K = 3) 

0.517 


Table 3: The FI-measure comparison among different features. 



(a) Precision-Recall curves for BOW features. 



(b) Precision-Recall curves for BOW + POS features. 

Figure 4: Precision-Recall curves for the BOW- 
class lexical features. 



Recall 

(a) Precision-Recall curves for WS features. 



Recall 

(b) Precision-Recall curves for WS + POS features. 

Figure 5: Precision-Recall curves for the WS-class 
lexical features. 







































5-fold cross validation. Then we adopt held-out 
testing taking advantage of the 20% sentences left. 

Figure 4 and Figure 5 show the precision-recall 
curves for the twelve lexical features, and Ta¬ 
ble 3 displays the average FI-measure compari¬ 
son among different features. We find out that the 
WS-class features generally outperform the BOW- 
class features, and the short-distance contextual 
features (K = 1) are more effective than the long¬ 
distance ones (K = 2,3). 

6 Conclusion and Future Work 

As far as we know, it is the first attempt to deal 
with the task of entity linking based on the idea of 
distant supervision. We leverage a heuristic align¬ 
ment assumption, i.e., the topic equivalent pages, 
to bridge the gap between Freebase and Wikipedia 
and jointly use those two knowledge bases to au¬ 
tomatically produce training data without manual 
annotation. Moreover, we propose a strategy that 
transforms labels into features and feed them to 
a general classifier, rather than building an indi¬ 
vidualized classifier for each ambiguous name for 
millions of entities. 

For the future work, we believe that this new 
paradigm leaves several open questions: 

• Besides the entities (MIDs) that have already 
been stored in knowledge repositories (Free- 
base), new entity instances (NIL) with the 
same name need to be discovered. There¬ 
fore, further study could focus on extending 
paradigm to identify unknown entities. 

• The link for many other webpages in different 
languages arc also provided in Freebase, as 
illustrated in Figure 2. It may facilitate the 
research of cross-lingual entity linking. 

• The alignment assumption is simple and 
heuristic. Further studies may dedicate on 
discovering other reasonable alignment prin¬ 
ciples. 

• Even though the strategy for generating train¬ 
ing data that fits a general classifier, it rises 
the problem that high-dimensional sparse and 
noisy features impact the effectiveness and 
efficiency of the proposed paradigm. 

Generally speaking, the experiments prove that 
our new proposed paradigm is promising and it is 
worthy of being further studied. 
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